hardware target
LightCode: Compiling LLM Inference for Photonic-Electronic Systems
Tomich, Ryan, Zhong, Zhizhen, Englund, Dirk
The growing demand for low-latency, energy-efficient inference in large language models (LLMs) has catalyzed interest in heterogeneous architectures. While GPUs remain dominant, they are poorly suited for integration with emerging domain-specific accelerators like the Photonic Tensor Units (PTUs), which offer low-power, high-throughput linear computation. This motivates hybrid compilation strategies that combine photonic and electronic resources. We present LightCode, a compiler framework and simulator for mapping LLM inference workloads across hybrid photonic-electronic systems. LightCode introduces the Stacked Graph, an intermediate representation that encodes multiple hardware-specific realizations of each tensor operation. Hardware assignment is formulated as a constrained subgraph selection problem optimized for latency or energy under parametric cost models. We evaluate LightCode on the prefill stage of GPT-2 and Llama-7B showing that under our workload and hardware assumptions, (i) Photonic hardware reduced energy by up to 50% in our simulated workloads at maximum sequence length; (ii) multiplexing and assignment strategy yielded latency improvements exceeding 10x; and (iii) Optimizing for latency or energy resulted in distinct hardware mappings in our simulations. LightCode offers a module, foundational framework and simulator for compiling LLMs to emerging photonic accelerators.
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > California > San Diego County > La Jolla (0.04)
- (2 more...)
MiCo: End-to-End Mixed Precision Neural Network Co-Exploration Framework for Edge AI
--Quantized Neural Networks (QNN) with extremely low-bitwidth data have proven promising in efficient storage and computation on edge devices. T o further reduce the accuracy drop while increasing speedup, layer-wise mixed-precision quantization (MPQ) becomes a popular solution. However, existing algorithms for exploring MPQ schemes are limited in flexibility and efficiency. Comprehending the complex impacts of different MPQ schemes on post-training quantization and quantization-aware training results is a challenge for conventional methods. Furthermore, an end-to-end framework for the optimization and deployment of MPQ models is missing in existing work. In this paper, we propose the MiCo framework, a holistic MPQ exploration and deployment framework for edge AI applications. The framework adopts a novel optimization algorithm to search for optimal quantization schemes with the highest accuracies while meeting latency constraints. Hardware-aware latency models are built for different hardware targets to enable fast explorations. After the exploration, the framework enables direct deployment from PyT orch MPQ models to bare-metal C codes, leading to end-to-end speedup with minimal accuracy drops. Tiny machine learning (ML) and edge artificial intelligence (AI) are becoming increasingly important and valuable in today's AI ecosystem. However, deploying AI models on edge devices is challenging due to the tight resource constraints.
- Asia > China > Hong Kong (0.40)
- Asia > China > Guangdong Province > Guangzhou (0.40)
- Europe (0.04)
- North America > United States > California > San Diego County > Carlsbad (0.04)
Simplify deploying YOLOv5 to using new OctoML CLI
Follow along with our new YOLOv5 deployment tutorial to power your next object detection application. Or, watch this tutorial video by Smitha Kolan on how to deploy YOLOV5 in under 15 minutes using the OctoML CLI. Today, we are excited to announce the results of our collaboration with Ultralytics to deploy the YOLOv5 models to over 100 CPU and GPU hardware targets in AWS, Azure and GCP. Our engineering work with Ultralytics unlocks the ability to deploy YOLOv5 models on hardware from Intel, NVIDIA, Arm and AWS, with minimal effort and cost. In this blog, I'll show you how simple it is to achieve hardware independence and cost savings across multiple clouds.
The Next Big Programming Language You've Never Heard Of
At the International Conference on Programming Language Design and Implementation (2022), scientists from MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) published a research paper titled, 'Exocompilation for productive programming of hardware accelerators' that proposes a new programming language, 'Exo', which can be used for writing high-performance code on hardware accelerators. Exo is a domain-specific programming language that helps low-level performance engineers transform very simple programs which specify what they want to compute into very complex programs that do the same thing as the specification but much faster. It is both a programming language and a compiler and allows custom hardware instructions, specialised memories and accelerator configuration states to be defined in user libraries. Exo builds on the idea of user scheduling to externalise hardware mapping and optimisation decisions. Accelerators like GPUs and image signal processors play an increasingly important role in modern computer systems.
AI design changes on the horizon from open-source Apache TVM and OctoML
In recent years, artificial intelligence programs have been prompting changes in computer chip designs, and novel computers have made new kinds of neural networks in AI possible. There is a powerful feedback loop going on. In the center of that loop sits software technology that converts neural net programs to run on novel hardware. And at the center of that sits a recent open-source project gaining momentum. Apache TVM is a compiler that operates differently from other compilers.
- Semiconductors & Electronics (0.50)
- Banking & Finance (0.48)
- Information Technology > Hardware (0.30)
'Octomize' Your ML Code
If you're spending months hand-tuning your machine learning model to run well on a particular type of processor, you might be interested in a startup called OctoML, which recently raised $28 million to bring its innovative "Octomizer" to market. Octomizer is the commercial version of Apache TVM, an open source compiler that was created in Professor Luiz Ceze's research project in the Computer Science Department at the University of Washington. Datanami recently caught up with the professor–who is also the CEO of OctoML–to learn about the state of machine learning model compilation in a rapidly changing hardware world. According to Ceze, there is big gap in the MLOps workflow between the completion of the machine learning model by the data scientist or machine learning engineer, and deployment of that model into the real world. Quite often, the services of a software engineer are required to convert the ML model, which is often written in Python using one of the popular frameworks like TensorFlow or PyTorch, into highly optimized C or C that can run on a particular processor.
plaidml/plaidml
This will act as our development branch going forward and will allow us to more rapidly prototype the changes we're making without breaking our existing user base. As a precaution, please note that certain features, tests, and hardware targets may be broken in plaidml-v1. You can continue to use code on the master branch or from our releases on PyPI. For your convenience, the contents of our master branch will be released as version 0.7.0. We are keeping the master branch of PlaidML stable and maintaining it until plaidml-v1 is ready for production.
pytorch/glow
Glow is a machine learning compiler and execution engine for various hardware targets. It is designed to be used as a backend for high-level machine learning frameworks. The compiler is designed to allow state of the art compiler optimizations and code generation of neural network graphs. This library is experimental and in active development. Glow lowers a traditional neural network dataflow graph into a two-phase strongly-typed intermediate representation (IR).